【第4天】資料前處理-圖檔分類與裁切

2021 iThome 鐵人賽

DAY 4

AI & Data

手寫中文字之影像辨識系列第 4 篇

13th鐵人賽

Ethan Chen

2021-09-19 23:16:07

4808 瀏覽

分享至

現況

以YOLOv4模型框選中文字後，將資料集(約7萬張)區分為以下類別：

1.1 word(僅有1個中文字)

1.2 words(2個以上中文字)

1.3 no_word(無文字)
因「正式比賽時，每張圖檔內只會有一個最正確的中文字」，篩選出僅有1個中文字的圖檔作為新資料集。
篩選出word(僅有1個中文字)的圖檔後，仍有以下問題。

3.1 僅有1個中文字的圖檔，物件偵測框外仍有大面積空白。故以opencv-python去除空白背景。

3.2 裁切後的圖檔中，有許多錯誤的標籤(如，圖檔中的文字是「鴻」，標籤卻是「卓」)

3.3 更正標籤後的圖檔中，有部分標籤名稱不在主辦單位提供的800字內。

工具/套件

opencv-python
shutil
numpy

內容

圖檔分類

1.1 物件偵測：讀取YOLOv4模型框選中文字，並回傳物件偵測框選範圍(boxes)，取len(boxes)可以得知，該圖檔框選出幾個中文字。

import cv2
import numpy as np
import os
import shutil

#讀取模型與訓練權重
def initNet():
    CONFIG = 'yolov4-tiny-myobj.cfg'
    WEIGHT = 'yolov4-tiny-myobj_last.weights'

    net   = cv2.dnn.readNet(CONFIG,WEIGHT)
    model = cv2.dnn_DetectionModel(net)
    model.setInputParams(size=(416,416),scale=1/255.0)
    model.setInputSwapRB(True)
    return model

#物件偵測
def nnProcess(image, model):
    classes, confs, boxes = model.detect(image, 0.4, 0.1)
    return classes, confs, boxes

1.2 圖檔分類

為了避免變動原始資料集，決定以shutil將分類後的檔複製到新資料夾。

以框選的中文字數量(box_num)，執行圖檔分類，程式碼如下。

#依照偵測到的物件數量進行分類
def copyClassify(file ,input, boxes, file_name, l, m, n):
    box_num = len(boxes)
    if box_num == 0:
        shutil.copy2(input, './02_yolo_classify3/03_no_word/{}'.format(file_name))
        print('※{}成功複製到no_word'.format(file))
    elif box_num == 1:
        shutil.copy2(input, './02_yolo_classify3/01_word/{}'.format(file_name))
        print('※{}成功複製到word'.format(file))
    else:
        shutil.copy2(input, './02_yolo_classify3/02_words/{}'.format(file_name))
        print('※{}成功複製到words'.format(file))
    print('  沒有字：{}張'.format(l))
    print('  1個字：{}張'.format(m))
    print('  2個字以上：{}張'.format(n))

儲存/讀取圖檔

opencv-python儲存圖檔時，若存檔路徑中有中文字，須使用cv2.imdecode( )。(cv2.imwrite僅適用英文路徑)

# 儲存已完成前處理之圖檔(中文路徑)
def saveClassify(image, output, p):
    cv2.imencode(ext='.jpg', img=image)[1].tofile(output)
    print('第{}張框字並儲存成功'.format(p))

讀取圖檔時，若路徑中有中文字，亦須使用cv2.imdecode( )。(cv2.read僅適用英文路徑)

# 讀取圖檔(中文路徑)
cv2.imdecode(np.fromfile(img_path, dtype=np.uint8), -1)

圖檔裁切

3.1 裁切：物件偵測框線設定為2px，裁切時須要注意是否會超出圖片範圍。

#框選偵測到的物件，並裁切
def drawBox(image, classes, confs, boxes):
    new_image = image.copy()
    cut_img_list = []
    for (classid, conf, box) in zip(classes, confs,boxes):
        x, y, w, h = box
        # 避免x, y軸超出圖片範圍
        if x - 2 < 0:
            x = 2
        if y - 2 < 0:
            y = 2
        # 畫出物件偵測框
        cv2.rectangle(new_image, (x - 2, y - 2), (x + w + 2, y + h + 2), (0, 255, 0), 2)
        # 裁切偵測框內的中文字
        cut_img = img[y:y + h + 2, x:x + w + 2]
        cut_img_list.append(cut_img)
    return new_image, cut_img_list[0]

3.2 裁切後圖檔，存檔時覆蓋新資料集(分類後的圖檔)。

if __name__ == '__main__':
    # 主辦單位提供的資料集(約7萬張)
    source = './01_origin/'
    files = os.listdir(source)
    # 依照正整數排序
    files.sort(key=lambda x:int(x[:-6]))
    model = initNet()
    for file in files:
        img = cv2.imdecode(np.fromfile(source+file,dtype=np.uint8), -1)
        classes, confs, boxes = nnProcess(img, model)
        try:
            frame, cut = drawBox(img, classes, confs, boxes)
            # 框選後的照片
            frame = cv2.resize(frame, (240, 200), interpolation=cv2.INTER_CUBIC)
            # 顯示框選後的圖片
            cv2.imshow('img', frame)
            # 裁切後的照片
            cut2 = cv2.resize(cut, (80, 60), interpolation=cv2.INTER_CUBIC)
            cv2.imshow('cut', cut2)
            cv2.waitKey()
            saveClassify(cut2, './02_yolo_classify3/cut2/' + file, p) #儲存裁切後的照片
        except:
            continue
    print('程式執行完畢')

3.3 成果

裁切前
裁切後

標籤錯誤：
- 「達成人工智慧之前，免不了先經歷工人智慧」。夥伴們人數眾多，逐張檢查圖檔標籤，並手動更正標籤。
- 整整有6.6萬張圖檔，夥伴們除了耗費大量時間檢查修正，甚至可能頭昏眼花看錯，效率低下。(痛苦程度300分)
- 若大家有更好的標籤勘誤技巧，請留言告訴我，謝謝！

標籤不在800字內

5.1 800字字典(txt檔)

5.2 判定標籤是否在800字內，程式碼如下。

import os
import shutil

#讀取txt檔
def read_dicts(path):
    file1 = open(path, 'rt', encoding="utf-8")
    words = file1.read().split('\n')
    file1.close()
    return words

#判定是否屬於字典中的字
def chech_in_dicts(source, words):
    files = os.listdir(source)
    files.sort(key=lambda x:int(x[:-6]))
    move_record = ''
    print('※開始判定是否屬於字典中的字...')
    for file in files:
        if file[-5:-4] in words:
            print('{}在字典裡'.format(file))
        else:
            print('{}不在字典裡'.format(file))
            file += ','
            move_record += file
    print('判定完畢')
    return move_record

#移動檔案到目標資料夾
def move_to_des(move_record, source, destination):
    move_list = move_record.split(',')[:-1]
    print('※開始移動檔案到目標資料夾')
    for move_it in move_list:
        shutil.move(source+move_it, destination)
        print('{}已成功移動到資料夾：其他字'.format(move_it))
    print('移動完畢')

if __name__ == '__main__':
    # training data dic.txt
    dics = './data/training data dic.txt'
    # 待判定的資料夾
    source = './data/04_清洗標籤後圖片/origin/'
    # 目的地資料夾
    destination = './data/04_清洗標籤後圖片/800字外/'
    words = read_dicts(dics)
    move_record = chech_in_dicts(source, words)
    move_to_des(move_record, source, destination)

5.3 成果